Skip to content

Comments

Optimized text for full unicode and some escape sequences (#129169)#7

Open
MitchLewis930 wants to merge 1 commit intopr_017_beforefrom
pr_017_after
Open

Optimized text for full unicode and some escape sequences (#129169)#7
MitchLewis930 wants to merge 1 commit intopr_017_beforefrom
pr_017_after

Conversation

@MitchLewis930
Copy link

@MitchLewis930 MitchLewis930 commented Jan 29, 2026

PR_017

Summary by CodeRabbit

  • Bug Fixes

    • Improved JSON string parsing to correctly handle escape sequences and multi-byte UTF-8 characters, enhancing accuracy in content parsing.
    • Optimized string value extraction for improved performance and reliability.
  • Tests

    • Significantly expanded test coverage for JSON string handling with randomized escape sequences, Unicode codepoints, and diverse input scenarios to ensure comprehensive validation.

…9169)

Follow-up to elastic#126492 to apply the json parsing optimization to strings
containing unicode characters and some backslash-escaped characters.

Supporting backslash-escaped strings is tricky as it requires modifying the
string. There are two types of modification: some just remove the backslash
(e.g. \", \\), and some replace the whole escape sequence with a new
character (e.g. \n, \r, \u00e5). In this implementation, the optimization
only supports the first case--removing the backslash. This is done by
making a copy of the data, skipping the backslash. It should still be more
optimized than full String decoding, but it won't be as fast as 
non-backslashed strings where we can directly reference the input bytes.

Relates to elastic#129072.
@coderabbitai
Copy link

coderabbitai bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

The JSON string parsing in ESUTF8StreamJsonParser is refactored from linear while-loop processing to a state machine-based approach using switch statements. This tracks escape sequences and string length separately, reconstructing output by removing escape characters when needed. Tests are rewritten with randomized input generation to validate the new parsing behavior.

Changes

Cohort / File(s) Summary
JSON Parser State Machine
libs/x-content/impl/src/main/java/.../ESUTF8StreamJsonParser.java
Replaced linear parsing loop with switch-based state machine handling ASCII characters, escape sequences, UTF-8 multi-byte sequences, and string termination. Added stringLength and backslashes tracking; modified getValueAsText to use computed length and conditionally reconstruct output when escapes are present.
Parser Tests & Validation
libs/x-content/impl/src/test/java/.../ESUTF8StreamJsonParserTests.java
Rewrote testGetValueRandomized with manual input generation using TestInput record and ESCAPE_SEQUENCES mappings. Added helper methods for random codepoint and input building. Changed assertions to scenario-based logic supporting optimized text paths, replacing prior XContentBuilder approach.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A parser transformed with clever state,
Where backslashes dance, escape sequences wait,
Switch logic bounds through UTF-8 streams,
Tests randomized fulfill their dreams—
Strings now parsed with pristine precision! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: optimization of text parsing for full Unicode and some escape sequences in the JSON parser.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr_017_after

Comment @coderabbitai help to get the list of available commands and usage tips.

@MitchLewis930
Copy link
Author

@CodeRabbit full review

@coderabbitai
Copy link

coderabbitai bot commented Jan 29, 2026

✅ Actions performed

Full review triggered.

@MitchLewis930
Copy link
Author

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

✅ Actions performed

Full review triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@libs/x-content/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java`:
- Around line 52-57: getValueAsText() currently returns a raw slice from
_inputBuffer when stringEnd > 0 which preserves escape sequences, causing
inconsistency after _finishAndReturnText() rebuilds de-escaped bytes; modify
getValueAsText() to detect when the token contains backslashes (or when
_finishAndReturnText() previously rebuilt bytes) and return the cached
de-escaped bytes instead of the original slice: call or reuse
_finishAndReturnText() (or store its result in a field like cachedTextBytes)
when escapes are present, and in the stringEnd fast path return a Text backed by
those de-escaped bytes (use identifiers _finishAndReturnText, cached buffer
field, stringEnd, _inputBuffer, _inputPtr, stringLength, _tokenIncomplete to
locate code).
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03ba5b1 and 96300a9.

📒 Files selected for processing (2)
  • libs/x-content/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java
  • libs/x-content/impl/src/test/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParserTests.java
🔇 Additional comments (7)
libs/x-content/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java (2)

24-31: State tracking additions look good.

The added length tracking and backslash bookkeeping are appropriate for the optimized path.


74-138: Switch-based parsing and buffer rebuild logic look solid.

The state machine and the split between direct-slice and rebuilt-buffer paths are clear and maintain the optimized intent.

libs/x-content/impl/src/test/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParserTests.java (5)

23-23: LGTM.

The Locale import is used by the new escape formatting.


40-95: Test updates for optimized Text path are clear and thorough.

The assertions now explicitly validate the Text-based path across escaped and unescaped cases.


134-145: Good addition of input/result modeling for escapes.

The TestInput record and ESCAPE_SEQUENCES table make test intent explicit.


146-191: Random input generator is well-structured.

The helper methods capture both optimized and fallback paths cleanly.


194-231: Randomized test flow aligns with new optimized behavior.

The branch logic matches the expected contract for getValueAsText() vs getValueAsString().

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 52 to 57
public Text getValueAsText() throws IOException {
if (_currToken == JsonToken.VALUE_STRING && _tokenIncomplete) {
if (stringEnd > 0) {
final int len = stringEnd - 1 - _inputPtr;
// For now, we can use `len` for `stringLength` because we only support ascii-encoded unescaped strings,
// which means each character uses exactly 1 byte.
return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), len);
return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), stringLength);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Ensure repeated getValueAsText() calls return de‑escaped bytes.

When backslashes are present, _finishAndReturnText() returns a rebuilt buffer without escapes, but subsequent calls hit the stringEnd fast path and return a slice of the original input (still containing backslashes). That makes repeated calls inconsistent.

Consider caching the rebuilt bytes (or rebuilding on demand when backslashes exist) and using them in the stringEnd path.

🔧 One way to make repeated calls consistent
@@
-    private final List<Integer> backslashes = new ArrayList<>();
+    private final List<Integer> backslashes = new ArrayList<>();
+    private byte[] stringBytes;
@@
-            if (stringEnd > 0) {
-                final int len = stringEnd - 1 - _inputPtr;
-                return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), stringLength);
-            }
+            if (stringEnd > 0) {
+                if (stringBytes != null) {
+                    return new Text(new XContentString.UTF8Bytes(stringBytes), stringLength);
+                }
+                final int len = stringEnd - 1 - _inputPtr;
+                return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), stringLength);
+            }
@@
-        stringLength = 0;
-        backslashes.clear();
+        stringLength = 0;
+        backslashes.clear();
+        stringBytes = null;
@@
-            return new Text(new XContentString.UTF8Bytes(buff), stringLength);
+            stringBytes = buff;
+            return new Text(new XContentString.UTF8Bytes(buff), stringLength);
🤖 Prompt for AI Agents
In
`@libs/x-content/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java`
around lines 52 - 57, getValueAsText() currently returns a raw slice from
_inputBuffer when stringEnd > 0 which preserves escape sequences, causing
inconsistency after _finishAndReturnText() rebuilds de-escaped bytes; modify
getValueAsText() to detect when the token contains backslashes (or when
_finishAndReturnText() previously rebuilt bytes) and return the cached
de-escaped bytes instead of the original slice: call or reuse
_finishAndReturnText() (or store its result in a field like cachedTextBytes)
when escapes are present, and in the stringEnd fast path return a Text backed by
those de-escaped bytes (use identifiers _finishAndReturnText, cached buffer
field, stringEnd, _inputBuffer, _inputPtr, stringLength, _tokenIncomplete to
locate code).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants